AITopics | supervisory signal

The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories. However, it is labor intensive to temporally annotate audio and visual events and thus hampers the learning of a parsing model. To this end, we propose to explore additional cross-video and cross-modality supervisory signals to facilitate weakly-supervised audio-visual video parsing. The proposed method exploits both the common and diverse event semantics across videos to identify audio or visual events. In addition, our method explores event co-occurrence across audio, visual, and audio-visual streams.

cross-modality signal, name change, weakly-supervised audio-visual video parsing, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.95)

Add feedback

Exploiting weakly supervised visual patterns to learn from partial annotations

Neural Information Processing SystemsDec-23-2025, 17:31:35 GMT

As classifications datasets progressively get larger in terms of label space and number of examples, annotating them with all labels becomes non-trivial and expensive task. For example, annotating the entire OpenImage test set can cost $6.5M. Hence, in current large-scale benchmarks such as OpenImages and LVIS, less than 1\% of the labels are annotated across all images. Standard classification models are trained in a manner where these un-annotated labels are ignored. Ignoring these un-annotated labels result in loss of supervisory signal which reduces the performance of the classification models. Instead, in this paper, we exploit relationships among images and labels to derive more supervisory signal from the un-annotated labels. We study the effectiveness of our approach across several multi-label computer vision benchmarks, such as CIFAR100, MS-COCO panoptic segmentation, OpenImage and LVIS datasets. Our approach can outperform baselines by a margin of 2-10% across all the datasets on mean average precision (mAP) and mean F1 metrics.

exploiting weakly supervised visual pattern, name change, partial annotation, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning to Parameterize Visual Attributes for Open-set Fine-grained Retrieval Shijie Wang

Neural Information Processing SystemsNov-19-2025, 20:08:31 GMT

Though important, attribute modeling usually requires significant manual annotations and thus is labor-intensive.

artificial intelligence, machine learning, retrieval model, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(9 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

Zhang, Mengtan, Guo, Zizhan, Zhao, Hongbo, Feng, Yi, Xiong, Zuyi, Wang, Yue, Du, Shaoyi, Wang, Hanli, Fan, Rui

arXiv.org Artificial IntelligenceNov-4-2025

Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.

artificial intelligence, estimation, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2511.01502

Country: Asia > China (0.47)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

066ca7bf90807fcd8e4f1eaef4e4e8f7-Paper.pdf

Neural Information Processing SystemsOct-9-2025, 13:05:14 GMT

artificial intelligence, dataset, machine learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Industry: Leisure & Entertainment > Sports > Tennis (0.47)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

Reflective Paper-to-Code Reproduction Enabled by Fine-Grained Verification

Zhou, Mingyang, Yao, Quanming, Du, Lun, Wei, Lanning, Zheng, Da

arXiv.org Artificial IntelligenceAug-26-2025

Reproducing machine learning papers is essential for scientific progress but remains challenging for both humans and automated agents. Existing agent-based methods often struggle to fully and accurately reproduce implementation details such as mathematical formulas and algorithmic logic. Previous studies show that reflection with explicit feedback improves agent performance. However, current paper reproduction methods fail to effectively adopt this strategy. This gap mainly arises from the diverse paper patterns, complex method modules, and varied configurations encountered in research papers. Motivated by how humans use systematic checklists to efficiently debug complex code, we propose \textbf{RePro}, a \textbf{Re}flective Paper-to-Code \textbf{Repro}duction framework that automatically extracts a paper's fingerprint, referring to a comprehensive set of accurate and atomic criteria serving as high-quality supervisory signals. The framework first generates code based on the extracted information, and then leverages the fingerprint within iterative verification and refinement loop. This approach systematically detects discrepancies and produces targeted revisions to align generated code with the paper's implementation details. Extensive experiments on the PaperBench Code-Dev benchmark have been conducted, RePro achieves 13.0\% performance gap over baselines, and it correctly revises complex logical and mathematical criteria in reflecting, on which the effectiveness is obvious.

criteria, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.16671

Country: Asia (0.46)

Genre: Research Report > New Finding (0.46)

Technology: